Productionizing R scripts in the cloud

Gergely Daróczi

April 6, 2019

dummy slide

Intro

Every Data Science project starts with … ETL

  • Have you ever written an R script to be run in a non-interactive way?
  • Have you ever scheduled an R script to run without human intervention?
  • Did it work?
  • Do you have any R script running in production?
  • … with central logging, proper monitoring and alerting?
  • … run on cheap spot-instances with real-time performance metrics collected and feeding an AI picking the optimal instance type for the next run?

Every BI consulting firm have developed its own …

  • job scheduler
  • data repository and metadata documentation tool
  • central logging
  • and a few other things …

Every BI consulting firm have developed its own …

  • job scheduler
  • data repository and metadata documentation tool
  • central logging
  • and a few other things …

Today we can just pick the right open-source tool, such as

  • Jenkins
  • Airflow or Luigi
  • cronR and taskscheduleR R packages
  • cloud services (eg AWS Batch, Cloudwatch, Lambda)

Source: thecodinglove.com

Clean code!

Code style

kid driving the car animgif?

Code style

It can be anything … just be consistent!

Extra hints:

DRY

Document and potentially open-source your code

  • Use roxygen2, potentially with markdown
  • Run R CMD check frequently to check on your docs
  • Pick a permissive license
  • Build your R package homepage with pkgdown

Don’t get confused …

… use version control!

git 101

Start from scratch:

Contribute to an existing project:

When I desperatly search in the logs why the app crashed

logger: a lightweight, modern and flexibly logging utility for R

cat demo.R

Colorized output

The Anatomy of a Log Request

Custom formatter / layout / appender

More about logging

https://daroczig.github.io/logger

Database connections

Why secure database credentials?

“When I woke up the next morning, I had four emails and a missed phone call from Amazon AWS – something about 140 servers running on my AWS account, mining Bitcoin.” – Andrew Hoffman

Loading MySQL configuration from global options

But how to set those global options on the server?

Loading MySQL configuration from the keyring

Great for the single-desktop R user, but how to make use of it on a remote server?

Using a pre-configured Data Source Name

But we still need someone to set up / deploy configuration.

Using a MySQL configuration file

But we still need to set up ~/.my.cnf:

And then .pgpass etc as well.

Loading MySQL configuration from more general, custom files

Again, how to get those unencrypted RData files to the server?

Loading MySQL configuration from encrypted custom files

But how to get the private key to a new server?

Loading MySQL configuration from environment variables

Loading MySQL configuration from config files v1

With the below YAML config:

Loading MySQL configuration from config files v2

With the below YAML config:

But again, we have to get the YAML file to the server in a secure way :/

Amazon KMS for passwords

Source: AWS Encryption SDK

Amazon KMS for 4+ kbytes data

Source: AWS Encryption SDK

Loading MySQL configuration from config files v3

With the below YAML config:

TODO

sss

Examples

Storing secrets

todo

Examples

Scheduling jobs

Installing Jenkins

  1. Install Jenkins from https://pkg.jenkins.io/debian-stable/
wget -q -O - https://pkg.jenkins.io/debian-stable/jenkins.io.key | sudo apt-key add -
echo "deb https://pkg.jenkins.io/debian-stable binary/" | sudo tee -a /etc/apt/sources.list
sudo apt update
sudo apt install openjdk-8-jdk-headless jenkins
sudo netstat -tapen | grep java
  1. Open up port 8080 in the related security group / firewall
  2. Access Jenkins from your browser and finish installation

    1. Read the initial admin password:
    sudo cat /var/lib/jenkins/secrets/initialAdminPassword
    1. Proceed with installing the suggested plugins
    2. Create your first user(s)

Scheduling jobs via cron

crontab.guru

H => hash

Jenkins output

rotation

sync to cloudwatch

Jenkins plugins

todo

extended email (edit subject / body, attachement, inline images – even from docker) slack notification

Reproducible jobs

  1. example dockerfile
  2. MRAN
  3. dockerhub
  4. exeample docker run command
  5. docker-run wrapper
  6. spot instances

https://stackoverflow.com/questions/8175912/load-multiple-packages-at-once/8176099#8176099

code

Later on

https://github.com/rajgoel/reveal.js-plugins/tree/master/embed-tweet

https://twitter.com/daroczig/status/675342535267495936

https://twitter.com/daroczig/status/682966258904510464